Noise Reduction

The way we optimize the policy is by maximizing the average rewards U(\theta). To do that we use stochastic gradient ascent. Mathematically, the gradient is given by an average over all the possible trajectories,

\nabla_\theta U(\theta) = \overbrace{\sum_\tau P(\tau; \theta)}^{ \begin{matrix} \scriptsize\textrm{average over}\\ \scriptsize\textrm{all trajectories} \end{matrix} } \underbrace{\left( R_\tau \sum_t \nabla_\theta \log \pi_\theta(a_t^{(\tau)}|s_t^{(\tau)}) \right)}_{ \textrm{only one is sampled} }

There could easily be well over millions of trajectories for simple problems, and infinite for continuous problems.

For practical purposes, we simply take one trajectory to compute the gradient, and update our policy. So a lot of times, the result of a sampled trajectory comes down to chance, and doesn't contain that much information about our policy. How does learning happen then? The hope is that after training for a long time, the tiny signal accumulates.

The easiest option to reduce the noise in the gradient is to simply sample more trajectories! Using distributed computing, we can collect multiple trajectories in parallel, so that it won’t take too much time. Then we can estimate the policy gradient by averaging across all the different trajectories

\left. \begin{matrix} s^{(1)}_t, a^{(1)}_t, r^{(1)}_t\\[6pt] s^{(2)}_t, a^{(2)}_t, r^{(2)}_t\\[6pt] s^{(3)}_t, a^{(3)}_t, r^{(3)}_t\\[6pt] \vdots \end{matrix} \;\; \right\}\!\!\!\! \rightarrow g = \frac{1}{N}\sum_{i=1}^N R_i \sum_t\nabla_\theta \log \pi_\theta(a^{(i)}_t | s^{(i)}_t)

Rewards Normalization

There is another bonus for running multiple trajectories: we can collect all the total rewards and get a sense of how they are distributed.

In many cases, the distribution of rewards shifts as learning happens. Reward = 1 might be really good in the beginning, but really bad after 1000 training episode.

Learning can be improved if we normalize the rewards, where \mu is the mean, and \sigma the standard deviation.

R_i \leftarrow \frac{R_i -\mu}{\sigma} \qquad \mu = \frac{1}{N}\sum_i^N R_i \qquad \sigma = \sqrt{\frac{1}{N}\sum_i (R_i - \mu)^2}

(when all the R_i are the same, \sigma =0, we can set all the normalized rewards to 0 to avoid numerical problems)

This batch-normalization technique is also used in many other problems in AI (e.g. image classification), where normalizing the input can improve learning.

Intuitively, normalizing the rewards roughly corresponds to picking half the actions to encourage/discourage, while also making sure the steps for gradient ascents are not too large/small.